Anova for Unbalanced Data: An Overview

نویسندگان

  • Ruth G. Shaw
  • Thomas Mitchell-Olds
چکیده

Ecological studies typically involve comparison of biological responses among a variety of environmental conditions. When the response variables have continuous distributions and the conditions are discrete, whether inherently or by design, then it is appropriate to analyze the data using analysis of variance (ANOVA). When data conform to a complete. balanced design (equal numbers of observations in each experimental treatment), it is straightforward to conduct an ANOVA, particularly with the aid ofthe numerous statistical computing packages that are available. Interpretation of an ANOVA of balanced data is also unambiguous. Unfortunately, for a variety of reasons, it is rare that a practicing ecologist embarks on an analysis of data that are completely balanced. Regardless of its cause, lack of balance necessitates care in the analysis and interpretation. In this paper, our aim is to provide an overview of the consequences of lack of balance and to give some guidelines to analyzing unbalanced data for models involving fixed effects. Our treatment is necessarily cursory and will not substitute for training available from a sequence of courses in mathematical statistics and linear models. It is intended to introduce the reader to the main issues and to the extensive statistical literature that deals with them. ANOVA AND BOONSOF BALANCE two treatment factors (A and B), each having several In this section we briefly review ANOVA, noting the different states or levels. All the possible combinations advantages of a strictly balanced design. For single facof the n, levels of Factor A with the n, levels of Factor tor analyses, lack of balance does not present serious B are generated, n, x n, = p, and one treatment comproblems (Milliken and Johnson, 1984: 127). We therebination is applied to each of the = p x r experifore discuss the two-way factorial design with the efmental units. where r is the number of observations fects of both factors considered fixed. because it is among per treatment combination, or cell One or more meathe simplest that reveals the main distinctions between sures (y)are taken on each expenmental unit. balanced and unbalanced cases. The principles we preGiven such a design, the rneans rnodel, sent hold, in general, for more complicated models involving fixed effects. Consideration of mixed and random effects models, in which interest focuses on where p,, is the mean of the ljth cell of the factonal the variance of effects, rather than on estimates of the design, e,,, is the deviation of the kth observation in effects themselves, is appreciably more complex and, the ijth cell from the mean of that cell, I = 1 to n,, J for this reason, is beyond the scope of this paper, but = 1 to n,, and k = 1 to r, expresses the individual a treatment of this topic can be found in textbooks on observations in terms of the cell means. the subject (Searle 197 1. 1987. Milliken and Johnson An alternative model, the effects nzodel, 1984; see also Shaw 1987a for a consideration of analysis of quantitative-genetic data). In the balanced two-way factorial design, there are where p is the grand mean, a, (h,) is the additive contribution of the ith 0th) level of Factor A (B) on the response. t,, is the deviation of the mean of the ijth cell ' For repnnts of this Special Feature. see footnote 1. p from the sum of the ith and Jth marginal means, e,,, is 16 15 Present address: Department of Ecology. Evolution and the deviation of the kth observation in the ijth cell Behavior, University of Minnesota. St. Paul. Minnesota 55 108 from the mean of that cell. i = 1 to nu, J = 1 to n,,. and USA. k = 1 to r, describes the effects of the treatment states 1639 September 1993 STATISTICAL METHODS FOR ECOLOGY on the responses. Whereas the means model h a s p parameters, one for each cell mean, the effects model has q = (1 + n, + n,, + p) parameters. With balanced data. the effects model often corresponds to factors or concepts being put to experimental test. However, it may be necessary to rely on the means model for analysis of some unbalanced data sets. In either model, it is assumed. for the purposes of hypothesis testing. that the e,,, are independent of one another and identically distributed in a Normal distribution having mean = 0, and some variance, 02. Regardless of the choice of parameterization. either model can be expressed conveniently in matrix notation as: where y is the A' x s matrix of s responses observed for each of A'individuals, X is the h x r (r = p or q, depending on the parameterization) "design matrix," b is an r x s matrix of the parameters of either of the two models. above. and e is an A' x s matrix of residuals. In the univariate case, s = 1, and y, b, and e are column vectors. The design matrix contains known constants denoting the contributions of particular parameters to the expected value of an individual. We usually define the element X,, = 0 otherwise. Note that the interpretation of the elements of b depends on the parameterization. The least squares estimate of the set of parameters. b. is The superscripts, T and 1. indicate matrix transpose and generalized inverse (Searle 197 1). respectively. This representation reveals that ANOVA can be viewed as a familiar problem of multiple regression analysis. When the means model (Eq. I) is used. the solution b, consisting of the estimates of the cell means, is readily obtained. In contrast, the effects model (Eq. 2) is over-parameterized (that is, there are more parameters than can be estimated from the available information), so the XTXmatrix is singular, and infinitely many solutions, b, exist. This problem can be resolved by imposing restrictions on the parameters of Eq. 2 (Searle 197 1, 1987. Milliken and Johnson 1984:Chapter 6). Differing restrictions produce distinct estimates of b. Regardless of the choice of restrictions, however, identical estimates are obtained for the estirnahlefunctions, linear combinations of the model parameters that by definition do not depend on the restrictions set on the parameters. For example, in the effects model specified above for the two-way crossed design (Eq. 2). estimable functions for each of the p cell means are obtained by summing the elements of b that estimate each effect contributing to a given mean (i.e., p + a, + h, + t,,). The important issues of estimability of the parameters in the effects model and estimable functions are considered in more detail by Milliken and Johnson (1 984) and Searle (1 987). The investigator usually wishes to answer each of the following questions by testing the corresponding null hypothesis, I-I,,: 1) Does the effect of one factor on the response variable(~) depend on the level of the other factor? I-I,,: There is no interaction between Factor A and Factor B. This null hypothesis is expressed as p,, p , , p,, + p , , = 0 for all I , I ' , J, J', with ' indicating distinct states of a factor (means model). or as (t,, t, + t , + t ) = 0 (effects model) (where . as a subscnpt indicates averaging over the levels of a given factor). 2) Do the levels of Factor A differ in their effects on the response variable(s)? H,,: There is no main effect of Factor A on the response. In the means model. this null hypothesis is given as p , = p, = . . . = p,, . The same hypothesis can be expressed in the effects model as all (a, + t ,) are equal. 3) Do the levels of Factor B differ in their effects on the response variable(s)? H,,:There is no main effect of Factor B on the response This null hypothesis can be expressed in terms of the parameters as p , = p = . . . = p , ] (means model) or (a, + t ,) are equal (effects model). Note that all hypotheses can be expressed in terms of either the means model or the effects model. Available statistical packages are based on the effects model, but Milliken and Johnson (1 984) demonstrate the utility of the means model, particularly for unbalanced designs. They also show how standard packages can be used to conduct ANOVA in terms of the means model. The analysis of variance procedure is so named, because it breaks down ("analyzes") the variance (actually, the total sum of the squared deviations of the responses from their grand mean [i.e., ssT = (N 1) times the variance]) into terms that quantify the magnitude of the overall variance in the response variable(~) attributable to the factors of the design, to their interaction, and to residual variability within cells of the design. There are several distinct methods for accomplishing the breakdown of the ssT into the ss for the different factors. Computing formulae are available in many statistical texts (e.g., Milliken and Johnson 1984:Chapters 9 and 10) and need not be repeated here. In the balanced case, as defined above, the methods yield identical results. and interpretation is therefore straightforward. When the design is balanced, moreover, the sums of squares corresponding to each of the factors, to their interaction, and to the residual variance are independent of one another, and these sums of squares are distributed according to the noncentral x2 distribution with their respective degrees of freedom (do. In the cases we are considering. a model 1640 SPECIAL FEATURE Ecology, Vol. 74. No. 6 having only fixed effects, the "importance" of a given factor is usually judged by comparison of the variance attributable to that factor to the residual (within-cell) variability. Thus, tests of each null hypothesis are developed by constructing the ratio of the mean square (MS= ss//df) for the particular effect to the residual MS. (We urge caution here, however. With certain designs (e.g., nested, split-plot), tests of particular effects require a denominator other than the residual MS: see Milliken and Johnson 1984:Chapters 5 and 24-32). When the null hypothesis holds, then, given a balanced design, this ratio of MS is distributed exactly according to the F distribution. When the design is unbalanced, the distinct methods of partitioning ss, do not produce the same results, the resulting ss associated with the two factors and their interaction are not necessarily independent of one another, and the ratio of MS is no longer exactly distributed according to the F distribution. Thus, loss of balance causes ambiguities that plague the processes of estimating the parameters, partitioning the ss, and testing the hypotheses of interest. Although this is certainly a discouraging situation, a careful analysis can often overcome these impediments and can provide a reasonably clear picture of the biology embodied in the data. We use the term "balance" to refer collectively to several dlstinct attributes of data structures. Balance can therefore be compromised in several different ways. whlch we describe in this section Given the balanced two-way factonal design described above, there are three ways that balance can be marred (1) the numbers of observations for the different treatment combinations may be unequal, (2) some of the cells (treatment combinations) may be missing altogether. and. (3) in multivariate data, some of the experimental units may have been measured for only a subset of the response vanables. We consider these in turn I'nequal sarnple size Probably the most common way in which data are unbalanced is by inequality of numbers of observations per cell. When the unit of observation is an individual organism. then mortality, emigration, or inability to relocate individuals for measurement can affect numbers representing a given treatment. Even when the unit of observation is a group of organisms, it is sometimes necessary to eliminate units due to accidents during application of the treatments or during measurement. If properties of the treatments are likely to be causally related to the variation in sample size among cells. then analysis of only the available responses (e.g., of survivors). ignoring the missing observations, would reveal only part of the effect (or even obscure the effect) of the treatments applied. (See Little and Rubin 1987: 8-9. and Maxwell and Delaney 1989:273, for further discussion of this point.) One way of achieving a more complete picture of the overall response to the treatments is to use categorical methods to explicitly analyze the effects of the treatments on final numbers of individuals in each cell (e.g., Shaw 1986. 1987h). The necessity of separately analyzing the realized cell sizes and the responses measured on remaining individuals is unfortunate. because the two analyses cannot be regarded as independent. and no joint analysis is readily available. However. we are optimistic that current research in theoretical statistics (e.g., Little and Rubin 1987:Chapters 11 and 12) will eventually permit joint analysis of the pattern of missing data together with variables measured on the remaining experimental units. In the following. we assume that the treatments do not directly cause the variation in sample size or that the variation in sample size is analyzed separately. Regardless of the cause of the variation in numbers of observations, its consequence is that there are more observations for some combinations of levels of the factors, and hence more information on the effect of these combinations, than for other combinations. That is, the levels of the factors, often called "independent variables." are not independent in the realized data. As a result, the estimates and tests of the effects of factors are also not generally independent. Thus, the lack of balance impairs the ability of the experiment to accomplish the usual aim of such studies: that of distinguishing the effects of the factors. A related consequence of inequality of sample size is that the various methods of computing ss statistics no longer yield identical results. Interpretations based on the diverse methods can differ profoundly, and the method to be preferred is often not obvious.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An exact permutation method for testing any effect in balanced and unbalanced fixed effect ANOVA

The ANOVA method and permutation tests, two heritages of Fisher, have been extensively studied. Several permutation strategies have been proposed by others to obtain a distribution-free test for factors in a fixed effect ANOVA (i.e., single error term ANOVA). The resulting tests are either approximate or exact. However, there exists no universal exact permutation test which can be applied to an...

متن کامل

Unbalanced robust ANOVA for the estimation of measurement uncertainty at reduced cost

Location S1A1 S1A2 S2A1 S2A2 Fig. 1 Design for the estimation of the random components of uncertainty arising from both the sampling and the analytical procedures. (A) Balanced design. (B) Unbalanced design. An empirical estimation of the random components of measurement uncertainty arising from the sampling and analytical processes can be made via an experiment involving replication. Analysis ...

متن کامل

Running a two-way unbalanced ANOVA with interactions

The yield was measured after the harvest. Because the 3 rd method was not tested on the 4 th type of field (because of a lack of seeds), and the 2 nd method on the 4 th type of field (because of a hail storm), the experiment is a typical example of an unbalanced ANOVA. We have performed an ANOVA with interactions in order to determine the interactions between the types of methods used and the t...

متن کامل

Testable Hypotheses for Unbalanced Neuroimaging Data

Unbalanced group-level models are common in neuroimaging. Typically, data for these models come from factorial experiments. As such, analyses typically take the form of an analysis of variance (ANOVA) within the framework of the general linear model (GLM). Although ANOVA theory is well established for the balanced case, in unbalanced designs there are multiple ways of decomposing the sums-of-sq...

متن کامل

Data Set 6-II: Mixed-Model ANOVA of the Effects of Host Species and Size on Parasitoid Size Statistical Setting

This Handout illustrates traditional mixed-model ANOVA analyses of the data from my study of host effects on parasitoid larvae (Data Set 6) when the host species is treated as a random factor. Note: Treating species as random would not be reasonable in reality, given that these two moth species are the only common hosts for this parasitoid in North Carolina, where I conducted this study. It als...

متن کامل

Which Sums of Squares Are Best in Unbalanced Analysis of Varia

Three fundamental concepts of science and statistics are entities, variables (which are formal representations of properties of entities), and relationships between variables. These concepts help to distinguish between two uses of the statistical tests in analysis of variance (ANOVA), namely • to test for relationships between the response variable and the predictor variables in an experiment •...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007